The Dataset

Description

The following acoustic properties of each voice are measured and included within the CSV:

Accuracy of various ML methods

Baseline (always predict male): 50% / 50%

Logistic Regression: 97% / 98%

CART: 96% / 97%

Random Forest: 100% / 98%

SVM: 100% / 99%

XGBoost: 100% / 99%

My aims

My aim is to build a model predicting label in the best way

What are the most important features that differ between male and female voices?

Imports

Read the data

Descriptive stats

Is label balanced?

NaN values

EDA & Preprocessing

Pairplot

Correlations heatmap

Zoom in some pairs

The first 8 variables describe distribution of voices' frequencies, they seem to be correlated

So we can drop skew, median, SD, Q25, Q75 and IQR

At the same time IQR, SD and Q25 seem to predict gender quite well

The next group of voice characteristics

meanfun seems to be a really good predictor! And it is not correlated with any other variable

Also, a lot of correlations here

So, we can remove sfm and, say, maxdom and dfrange

centroid is perfectly correlated with meanfreq + median, SD, Q25, Q75 and IQR

Then we can drop centroid

So, I will drop the following vars: centroid, median, maxdom, dfrange, sfm, sd

Quick tree with all variables

Convert label to 1 and 0

Split to train and test datasets

Fit hyper-parameters and k-fold cross-validation with k=5

{'n_estimators': 30, 'min_samples_split': 7, 'min_samples_leaf': 6, 'min_impurity_decrease': 0.0, 'max_depth': 8, 'criterion': 'gini', 'ccp_alpha': 0.0}

Prediction

Quality of prediction metrics and plots

Feature importance

Correlated features 'split' the importance metrics

Tree without collinear vars

As it could be predicted from the scatter plots, meanfun is the most important variable. Also, removing collinear variables didn't improve performance of the model, but made computations faster.

Logistic Regression

With proper C parameter LR works almost as good as RF